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I INTRODUCTION 


The purpose of statistics, like that of geometry or phy- 
sics, is to describe certain real phenomena. The objects of 
the real world can never be described in such a complete and 
exact way that they could form the basis of an exact theory. 
We have to replace them by some idealized objects, defined ex- 
plicitly or implicitly by a system of axioms. For instance, in 
geometry we define the basic notions "point," "straight line,” 
and "plane" implicitly by a system of axioms. They take the 
place of empirical points, straight lines and planes which are 
not capable of exact definition. In order to apply the theory 
to real phenomena, we need some rules for establishing the cor- 
respondence between the idealized objects of the theory and 
those of the real world. These rules will always be somewhat 
vague and can never form a part of the theory itself. 

The purpose of statistics is to describe certain aspects 
of mass phenomena and repetitive events. The fundamental 
notion used is that of "probability." In the theory it is de- 
fined either explicitly or implicitly by a system of axioms. 
For instance, Mises!) defines the probability of an event as 
the limit of the relative frequency of this event in an infin- 
ite sequence of trials satisfying certain conditions. This is 
an explicit definition of probability. Kolmogoroff®) defines 


probability as a set function which satisfies a certain system 


1) See references 10 and 11 


2) See reference 9 


of axioms. These idealized mathematical definitions are re- 
lated to the applications of the theory by translating the 
statement "the event E has the probability p" into the state- 
ment "the relative frequency of the event E in a long sequence 
of trials 1з approximately equal to p." This translation of a 
theoretical statement into an empirical statement is necessar- 
11у somewhat vague, for we have said nothing about the meanings 
of the words "long" or "approximately." But such vagueness is 
always associated site the application of theory to real phen- 
omena., 

It should be remarked that instead of the above translation 
of the word "probability" it 1s satisfactory to use the follow- 
ing somewhat simpler one: "The event E has a probability near 
to one" is translated into "it is practically certain that the 
event E will occur in a single trial." In fact, if an event 
E has the probability p then, according to a theorem of Ber- 
noulli,the probability that the relative frequency of E in a 
sequence of trials will be in a small neighborhood of p 1s 
arbitrarily near to 1 for a sufficiently long sequence of 
trisls. If we translate the expression "probability nearly 1" 
into "practical certainty," we obtain the statement "it is 
practically certain that the relative frequency of E in a long 
sequence of trials will be in a small neighborhood of p." 

In statistics we always construct some probability schemes 
which we believe to be adequate to describe certain real phen- 
omena. For instance, we describe the situation concerning the 
possible outcomes in tossing a coin by saying that the probabi- 
lity of obtaining a head in one toss is 1/2, for in a long se- 


quence of trials we would expect to have about half as many 
heads as total tosses. Or, if we measure the length of a bar 
by some instrument, we sometimes assume that the result is a 
normally distributed random variable. The notions of a random 
variable and a distribution function are defined as follows: 
if F(x) is a function expressing the probability that a real 
variable X < x, we say that X is a random variable and that 
F(x) 1s the probability distribution of X. Then, if F(x) is 
given by the formula 
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we say that X is normally distributed. The quantities o and y 
are real parameters. Thus, if in measuring the length of a bar 
by some instrument we assume that the outcome of the measure- 
ment is a normally distributed random variable, we may express 
the probability that a measurement will be less than a given 
value x by (1). 

If Xj, Хо, Хз,..., Хр represent n random variables and 
х], X2,..., Хр any Set of real numbers, we use the symbol 
Р(х], х2,,,•, Xn) to express the probability of the composite 
event that X; < ху, Xo < хо,.»„, XQ < x, Simultaneously. This 
function will be called the joint probability distribution of 
the n random variables. We shall say that n random variables 
are independently distributed if the function F(x), Xo,..., хр) 
is the product of n functions such that only Xj 18 involved in 
the first, only хо in the second, and so on. That is 


F(x) = fi(xi)fo(x2). fa xy). 


For example, if n measurements Xi, Xo,..., Хр of а bar are in- 
dependently and normally distributed with the same normal dis- 
tribution, we would obtain 


(2) F(x, хо, Xp) = 
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If we measure the length of a bar n times by some instru- 
ment, we sometimes find it appropriate to adopt the probability 
scheme that the results of the n measurements have a joint pro- 
bability distribution given by (2). 

One of the fundamental problems of statistical inference 
is that of testing statistical hypotheses. Thé most general 
form of a statistical hypothesis we have to deal with in 
Statistical theory may be expressed as follows. Let Xj,...,Xq 
be & finite set of random variables and let F(xj,...,XQ) be its 
Joint probability distribution function. Then the statistical 
hypothesis is the statement that the unknown distribution func- 
tion F(X, ¢++,X,) is an element of a certain class w of distri- 
bution functions. For instance, if ху,...Х„ аге successive 
measurements on the length о? а bar, we may consider the hypo- 
thesis that х,,...,„ аге independently distributed with the 
same normal distribution. In this case o is a two parameter 
family given by (2), c being any positive number and yp any real 


number. 


If we consider the hypothesis that X,,+++,X, are normally, 
independently distributed with zero means (р=0) and unit vari- 
ances (о2=1), then o consists of а single element. When the 
class w consists of a single element, we shall say that the 
hypothesis we are considering is a simple hypothesis.  Other- 
wise, it will be called composite. 

The question of testing a given hypothesis may be formu- 
lated in the following manner. We should like to mow, on the 
basis of n observations xj,...,X, where X, is the observed value 
of the random variable Xa (az1,...,n), whether to аёсер+ or re- 
ject the hypothesis Н, that the unknown distribution function 
Р(ху,...,ху) belongs to the class w. The set of n observations 
can be represented by & point E of n-dimensional Cartesian 
space, called the sample space. То test the hypothesis Н, on 
the basis of n observations we must choose a subset R of the 
sample space and then reject the hypothesis Hy if the sample 
point E falls within R. Otherwise, we maintain the hypothesis. 
It is evident that the fundamental problem here is the choice 
of the subset R, which we shall call the critical region. The 
solution of this problem depends, to same extent, upon any 
& priori knowledge we may have about the unknown distribution 
function P(xi,...,x4). One of tke most important and most fre- 
quent a priori assumptions is that the random variables Ху, 
are independently distributed, each having the same distribu- 


tion. Thus, we have the assumption that F is of the form 
n 
F(xj,...,X) * [TA (x) where P, = 9 fon all 1. 3. 


Such a priori knowledge about our unknown distribution 


function can always be expressed by saying that the function 


F(xi,...,Xp) із an element of a certain class Q of distribu- 
tion functions. The class w which is being considered is then 
always а subclass of С). We shall see that the choice of the 
critical region R for testing the hypothesis Hy will depend 
upon the a priori knowledge -С\_. 


It is now seen that the problem of testing hypotheses can 
be formulated as follows: Taking for granted that the unknown 
distribution function F is an element of a class CL., we wish 
to test the hypothesis that F belongs to a certain subclass w 
of .(1.. The problem to be solved is the question of how the 
critical region in the sample space should be chosen. 


For instance, (| may be defined by the statement that 
X ,+++,X, are independently and normally distributed each of 
them having the same distribution, and w may be the subclass of 
-С\_ defined by the additional restriction that the mean values 
of Х\,...,Х„ аге zero. In this case, according to certain 
Standards we will discuss later, the adequate critical region 


is given by the inequality 


хп — 
8 
2 
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where x - ГЕ and s^ = ns 


and c is a certain constant. If, however,() is a much broader 

class defined by the statement that X,,...,X, are independently 
distributed each having the same distribution, the above criti- 
cal region for testing H is not adequate, and some other criti- 


cal region has to be chosen. 


Before we proceed farther it might be well for us to list 
a few of the mathematical terms used together with their 


meanings in statistics. We can do this in tabular form. 


MATHEMATICAL TERMINOLOGY STATISTICAL INTERPRETATION 

n space, E, (sample space) Possible outcome of n obser- 
vations. 

Lv, class of functions on Ey Class of possible probability 
distributions. 

w, subclass of Q- The statistical hypothesis. 


The true distribution is a 
member of w. 


R, (critical region), a Criterion for rejecting the 
subset of E, hypothesis that the true dis- 
tribution is a member of w. 
Association of R with O Choice of the critical region 
&nd o. А for testing the hypothesis. 


The problem of testing hypotheses is only one of the prob- 
lems of statistical inference. Another 1s the problem of es- 
timation. Given that the unknown distribution function F be- 
longs to a certain class {of distribution functions, how can 
we choose a function ((E) defined for all points E of Ер such 
that the value of Q(E) is always an element of (2 and can be 
considered a "good" estimate of the unknown distribution func- 
tion P? We may say that 9(Е) is a "good statistical estimate" 
of F if the probability is as large as possible that 9(Е) 1s 
in a small neighborhood of F. We will formulate this principle 
more precisely in chapter III. 

If, for instance, is given by the statement that 
Ху,+,,,Х, are independently and normally distributed with the 
ваше means and unit variances, then Nis a one parameter 
family of distribution functions and an element of С із com- 
pletely specified by specifying the value of the unknown mean y. 


Hence, to estimate the unknown distribution function F 18 the 
same as to estimate the unknown mean p. In this case the pro- 
blem of estimation is the problem of finding a real function 
(Е) defined for all points E of the sample space such that 
(Е) can be considered as a statistical estimate of the un- 
known mean р. The classical solution of this problem in this 
particular case is given by 


X] +e ootXpy 
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The two types of problems of statistical inference men- 


tioned so far do not cover all possible problems”) 


The fol- 
lowing problem, for exemple, is neither a problem of testing a 
hypothesis nor one of estimation: Consider three subclasses 
Фу, Wg, Wg of the class СҮ о? distribution functions, and de- 
note by Ho, the hypothesis that the unknown distribution F 18 
an element of о. The problem considered is to decide on the 
basis of the n observations which of the three hypotheses 
should be accepted (assume that the sum of the three subclasses 
шу, 99, wg is equal to.(1.). Such a situation may arise, for 
instance, in the case of a manufacturer who has to keep the 
quality of his product between two limits, and wants to test, 
by sampling, whether the quality is actually between these 
limits, below the lower limit, or above the upper limit. (As- 


sume that the quality is measurable and can be represented by a 


real number.) 


3) See in this connection 16, pp 299-300. 


The reasons why sucb a "trilemma" is a problem different 
from testing a hypothesis or estimation can only be indicated 
here. It will be seen that there are many approaches to each 
problem of inference, and that the theory provides means of 
choosing smong them by deciding that certain approaches are 
"better" than certain others. Now, one might suggest the re- 
duction of the above "trileuma" to a problem of, say, estima- 
tion by estimating the unknown distribution function F and ac- 
cepting that hypothesis which corresponds to the subclass in 
which the estimate of F is contained. This would be one ans- 
wer to the trilemma, but by no means the "best" answer accord- 


ing to the standards developed. 


The most general formulation of the problem of statisti- 


cal inference is this: Let S be a system of subclasses of the 
class- of distribution functions. For each element в of S, 


consider the hypothesis Hg which states that the unknown dis- 
tribution F is an element of 8; denote by Hs the system of all 
such hypotheses; the problem is to decide, by means of & sample 
which element of Hg should be accepted. 


The problems enumerated before are special cases of this 


general problem. If S consists of two elements only, one being 
а subclass w of (1. and the other its complement in ©, the 
problem is the same as that of testing the hypothesis that the 
true distribution function F is an element of ш. If 8 is the 
system of all elements of, we have the problem of estima- 
tion. If S consists of three classes шу, Go, Wg With the sum 
-fi., we have the trilemms. 


UO OT A STATISTICAL HYPOTHESIS 4) 

The principles of statistical inference as developed in tre 
last two decades by R.A.Fisher, Neyman and Pearson deal with tre 
problem of testing a hypothesis and with the problem of estima- 
tion but not with the general problem of statistical inference 
as it has been formulated in the foregoing pages. A further re- 
striction in these theories is that they deal only with the case 
that.(1.1s а k-parameter family of distribution functions, i.e., 
that the true but unknown distribution function F is known to be 
an element of a k-parameter family of functions 

F(X], XQ,-+4,Xp, 07, Ogs.. « , €) 
where 01,...,0j are parameters. Іп this case the specification 
of the values of the parameters specifies completely the distri- 
bution function F. 

A set of parameter values can be represented by a point in 
a k-dimensional Euclidean space called a parameter space. Be- 
cause of the one-to-one correspondence between elements OL ES 
and points of the parameter space we сап identify-Cl with the 
parameter space. If for example, Xj,...,X, are normally and in 
dependently distributed, each having the same distribution 
(equation(2)), then the parameter space is a half plane where 
01 = р = mean value, and О © 05 = о = standard deviation. 

A hypothesis concerning F is expressed by the statement 
that the true parameter point lies in a certain subset w of the 
parameter space‘... As we have done before, we shall call the 
hypothesis a simple one if w consists of a single point. 

4) See, in this connection, references 12,13 and 14 


10 


11 


Otherwise, it is called a composite hypothesis. Іп the above 
example the statement that p = 0, о = 1 is a simple hypothesis, 
while merely stating that р = О without specifying o is a com- 
posite hypothesis. 

For the sake of simplicity we shall confine ourselves to 
the case of a single unknown parameter since this suffices to 
illustrate the basic ideas of the theories of Fisher, Neyman 
and Pearson. First, we shall deal with the Neyman-Pearson 
theory of testing a statistical hypothesis. 

We assume that the unknown distribution function is known 
to be an element of a one-parameter family F(X], X9,+++,%y, Ө) 
and we wish to test the hypothesis Ө = Ө. 

A simple example for this case is the following; Let it 
be known that ху, .. Хр are independently and normally distri- 
buted with the same mean and unit variances, i.e., is the 


one-parameter family of A DANS 
2» 3 хр у ёсе)... ы 
Р(х, ...,х, e) = zi. dv NS 


and assume that we wish to test the урон that © = 0. 
According to the classical theory we reject this hypothesis if 
and only if 


х1 +..0+ Xn 
= ттк) 


|е, (х 
where с denotes a certain constant. Тһе value of c is chosen 
in such a way that the probability of |х|» е under the assumption 
that the hypothesis @ = O is true, is so small that we are 
willing to reject the hypothesis. If we want this probability 


1.96 
to be 5 percent, then c = Is e 


n 


12 


If, in the same example, we have made only two observa- 
tions xi, Xp, зо that the sample space is the Euclidian plane, 


the critical region consists of all points for which ion*r 2)» 


1.96 L -1.96 
TAF and all points for which 2099) TS . If the point 


representing the observations falls within the critical region 
(1.e.,1f the arithmetic mean of the two observations is larger 
than 2:96 or smaller than E we Shall reject the hypothesis 


that the mean value is zero. 


С САД, / 
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But the classical theory does not suggest why this critical 
region should be used. It merely proves that the probability 
for the observation point to fall within the critical region 
is five percent when the initial hypothesis is fulfilled. But 
there are infinitely many regions which enjoy the same property, 
and the classical theory does not give any reasons why just the 
one region mentioned should be chosen. 

In order to arrive at a distinction between various criti- 
cal regions, Neyman and Pearson advance the following considera- 


tions. In making a statement of acceptance or rejection of a 


13 


hypothesis, we may commit two types of errors: rejecting the 
hypothesis although it is true (error of type I), or failing 
to reject it although it is false (error of type II). If the 
hypothesis consists in saying that the unknown parameter @ has 
a given value 05, the situation may be summarized as follows: 


Truth or Falsehood of Statement 


Concerning the Hypothesis Ө = 9, 


True Statement Advanced 
Situation 
e # AS 


pem m IMPR 


E ^ad eri 


By size of the critical region we mean the probability that the 
point representing the observations will fall within the criti- 
cal region, where the probability in question is calculated 
under the assumption that the hypothesis is true. (Thus, in 
the exemple used before, the size of the critical region was 
five percent.) This may be expressed by saying that the size 
of the critical region is equal to the probability of commit- 
ting a type I error. 

The general idea underlying the theory of Neyman and Pear- 
son is to minimize the probability of type II errors while keep- 
ing the probability of type I errors constant. 

If R is any region in the sample space, and E is the point 
of the sample space which represents the observations, we shall 


denote by P(R10,) the probability of E lying in R calculated 
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under the assumption that Өү is the true value of the unknown 
parameter Ө, that is to say, P(R|e3) is equal to the Stieltjes 
integral JR dF(xX1,...,XQ, Өү) over the region R. Thus, if we 
make the hypothesis © = @, and choose R as a critical region 
for this hypothesis, the size of the critical region will be 
given by the expression P(RI@,)- If the hypothesis is wrong 
and the true value of Ө is Өү, then the probability of avoiding 
an error of type II із P(R|ei). 

The expression Р(К|Өј), i.e., one minus the probability of 
an error of type II, is called the power of the critical region 
R with respect to the alternative hypothesis Ө = ө]. 

The expression Р(КІӨ) is a function of Ө. It may be plot- 
ted as a curve, the ordinate of which is equal to the size of R 
if the abscissa is @,, and equal to the power of R with respect 
to the alternative Ө = 0, if the abscissa is any value Өү Li 95. 
This curve is called the power curve of the region R. 

In the former example, in which the distribution was nor- 
mal with unknown mean and unit variance, and the critical re- 
gion chosen was iz] > 2:28 (where X 1s the arithmetic mean of 
the observations йу т. SET, the power curve can easily be 


calculated and has the form shown below: 


15 
In order to compare the test {> with other possible 
tests, we have to compare the above power curve with the power 
curves of other critical regions which have the same size, five 
percent. 

In general, if we have two critical regions R and R', both 
of which have the desired size, and if the power curve of ЕК! is 
above that of R for the value Ө = ө], then the critical region 
R' 18 better than R for testing the hypothesis if the true value 
of Ө happens to be 91. For the probability of committing a type 
I error is the same whether R or R! is used, while the probabi- 
lity of committing a type II error when using Н! is smaller thm 
when using Н. If the power curve of R' is above that of R for 
each Ө (except Ө, for which the two curves coincide by assump- 
tion), then R' will be called uniformly more powerful than R. 
The test using the critical region R is called non-admissibloe 
because its use is, under all circumstances, less favorable than 
the use of R'. 

In order to make this clear, let us assume that a large 
number of samples is drawn, each of which consists of N indivi- 
dual observations. Let M be the number of such samples and let 
two statisticians, whom we will call S and S', test the same 
hypothesis, using each of the M samples. Assume that S uses the 
critical region R for testing while §' bases his tests on the 
‘region R'. 5 and $8! will each obtain M answers to the question 
as to whether the null hypothesis (the hypothesis to be tested) 
should be rejected. Some of these answers will be right, others 
will be wrong. Let us compare the records of S and S!. We have 
to distinguish between the case that the null hypothesis is true 


and the case that it is false. a)In the first case, the answers 


16 


obtained by each statistician may either be that the hypothesis 
is to be accepted - these answers are right; or that it should 
be rejected - these answers are errors of type I. The probabi- 
lity of committing a type I error by testing the null hypothesis 
from a sample drawn at random is equal to the size of the criti- 
cal region used in testing. If M is large, it 1з practically 
certain that the relative frequency of type I errors will be ap- 
proximately equal to their probability, i.e., to the size of the 
critical region. Since R end Е! have, by assumption, equal sim, 
each of the two statisticians will commit approximately the 

same number of errors. b)If the null hypothesis їз false, some 
of the M answers obtained by each statistician will correctly 
reject it, while others will accept it, thus committing errors 
of type II. If M is large, the relative frequency of correct 
answers will be approximately equal to the power of the test 
used which we have pointed out is the probability of avoiding a 
type II error. By assumption, the power of R' is greater than 
that of Н, regardless of what the true value of Ө is, provided 
only that Ө is different from 05. Therefore, the relative fre- 
quency of wrong answers obtained by S will tend to be greater 
than the relative frequency of wrong answers obtained by S!. 
Thus, if the null hypothesis is false (no matter what the true 
value of Ө is), it is practically certain that S will make more 
false statements; while if the null hypothesis is true, 8 and 

S' will commit an approximately equal number of false statements. 
The method used by S!, 1.e., the application of the critical re- 
gion R', is therefore superior te the method used by S, ie., the 
application of the critical region R. 
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These considerations decide the choice between two criti- 
cal regions of equal size if one of them is uniformly more 
powerful than the other, 1.e., 1f the power curve of the former 
is above that of the latter for all values of Ө except Ө (for 
which the power curves coincide). On the other hand, if the 
power curve of Н! is above that of R for some values of Ө, but 
below it for other values of 0, then we cannot choose one of 
the two regions without introducing further principles on which 
to base the choice. 

If, for all values of Ө, the power curve of a region R is 
never below that of any other region Н! of equal size, then К 
is c&lled a uniformly most powerful region, and the test cor- 
responding to R a uniformly most powerful test. 


we can find a uniformly most powerful test, we shall prefer it 
to all other tests using regions of the same size.  Unfortun- 
&tely, uniformly most powerful tests do not exist in most cases. 
In the example which we have used on page 11 let us consid 
er the region Е! determined by the inequality x > 1:64 , rt 
can easily be shown that Е! (like the region R eal exces be- 
fore) has the size .05. Тһе power curves of R and ЕК! are shown 


below: 
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We can see that for all 0»0, R' із more powerful than R, 
and vice versa for Ө «0. In such cases further principles have 
to be formulated on which the choice should be bssed. It 1s 
clear that the choice we make will depend on our a priori de- 
gree of belief in the truth of the different possible values of 
Ө. For instance, if we know a priori that Ө cannot be negative, 
then we shall prefer R! 

Moreover, it can be shown that R' is uniformly most power- 
ful 1f the parameter space is restricted to non-negative values ` 
of Ө. If negative and positive values of Ө are considered a 
priori as equally possible we will most likely prefer R to Rt. 

This example shows also that the choice of the critical 
region depends essentially on.fi. If.(i.consiísts of all non- 
negative values of Ө then the region Б! is а uniformly most 
powerful test. If Q. consists of all non-positive values Ө, then 
the region R'' given by xq 1s & uniformly best region. 
Finally, if 0 сопзізіз of all real values Ө, then the use of the 
region R seems to be more reasonable than that of Е! or R''. 

Since uniformly most powerful regions rarely exist, Neyman 
and Pearson introduced a further principle on which the choice 
of the critical region should be based, namely, the principle 
of unbiasedness. A test is called unbiased if the power func- 
tion of the test has a relative minimum at the value Ө = 0, 
where @, is the hypothesis to be tested. 

Some rationalization of this principle can be given: Sup- 
pose a test is biased, then for some value Өү, in the neighbor- 
hood of Өш, the power of the test is less than the size of the 
region. But this means that the probability of rejecting the 
hypothesis Ө = 0, is larger if 6, is true than if Өү is true, 
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which is not a desirable situation. 

In general, an infinity of unbiased tests exist, hence we 
need a further principle in order to select a proper test from 
among them. We define as a uniformly most powerful unbiased 
test one which is at least as powerful or more powerful, with 
respect to all alternate hypotheses, than any other unbiased 
region of equal size. If a uniformly most powerful unbiased 
test exists, and if we accept the principle of unbiasedness, 
then it is obvious that it is the most advantageous test to 
use. Neyman and Pearson called a critical region corresponding 
to a uniformly most powerful unbiased test a critical region of 
type А. 

Referring to the example previously considered, the criti- 
cal region given by |х] > с ів а region of type Aj for testing 
the hypothesis in question. Another example of a region of 
type Ai is the following: Let Xj,...,Xpy be independently and 
normally distributed with zero means and a common variance. 
Then, for testing the hypothesis that the common variance c? is 
equal to 4.2, the critical region consisting of all points of 
the sample space which satisfy at least one of the inequalities 


x? +... + хаё] or 212 +... ужаш 4, 


1s & critical region of type A, if the constants cj and со are 
properly chosen. 

The region of type A, exists in an important, but very re- 
stricted, class of cases; there are many instances in which it 
does not exist. Therefore, Neyman and Pearson have introduced 
& third type of region, known as & region of type A. The re- 
gion R is said to be of type A if its power function P(W/0) is 
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such that 
i) ӘР(В|ө) = 
9e 
Ө = 96 
&nd 


ao? 


2 2 
2) 22Р(ВЈӨ) orgs Pini) | 
ae 
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for all regions Н! which satisfy 1) and have the same size as К. 
The first condition restricts the region to be unbiased. The 
second requires the power function of a region of type A to have 
а greater curvature than that of eny other unbiased region of 
the same size. To put it crudely, it means that the region is 
most powerful in the neighborhood of @). 

A critical region of type A exists under very weak condi- 
tions which are fulfilled in most of the practical cases.  How- 
ever, the objection can be raised against a region of type A 
that we are much more concerned with the behavior of the power 
function for alternatives Ө which are far from Ө, than for thae 
in the neighborhood of 05. In spite of this, as we will see, a 
good justification of the use of a type A region can be given 


in the light of some recent results. 


III Rk. A. FISHER'S THEORY OF ESTIMATION) 


The problem of estimation of the unknown parameter Ө is 
the problem of finding a function &(ху,...,х,) of the observa- 
tions such that t can be considered in a certain sense as a 
"good" or "best" estimate of 0. Since the estimate t(x1,*. 5x) 
is a random variable, we cannot expect that its value should 
coincide with that of the unknown parameter, but we will try to 
choose t(x,,-..,X,,) in such а way as to make as great as pos- 
sible the probability of the value of t lying as near as pos- 
sible to the value of the unknown parameter Ө. 

This 18 & somewhat vague formulation of the requirement 
for & "good" or "best" statistical estimate. It can be made 
precise in different ways. магкоѓгб), for instance, defines 
the notion of a "best" estimate as follows: A statistic t (we 
shall call any function of the observations a statistic) is a 
best estimate of Ө if 

(1) t 1з an unbiased estimate of e, 1.e.,Eg( t) * iden- 

tically in Ө where Eg(t) denotes the expected value of 
t under the assumption that 0 1s the true value of the 
parameter. 

(2) Eg(t-e)?« E (t'-6)? identically in Ө for all t! which 

satisfy (1). 
This definition of a "best estimate" seems to be a reasonable 
and acceptable one since, in general, the smaller the variance 


of t the greater is the probability that t will lie in a small 


usi E T EE a P OOE riri 
5) See references 3 - 6 


6) See reference 15, p.344 
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neighborhood of Ө. It should be remarked that although (by 
virtue of Tshebisheff's inequality) smallness of the variance 
implies that the probability of t lying in a small neighbor- 
hood of Ө is small, the converse is not necessarily true. It 
may happen that a statistic t has a large variance and, never- 
theless, the probability of t lying in a small neighborhood of 
@ is high. This circumstance constitutes some argument against 
Markoff's definition. A more serious difficulty is, however, 
the fact that a best estimate in Markoff's sense seldom exists. 

R. A. Fisher's theory of estimation is hased on the prin- 
ciple of the maximum likelihood. It is assumed that a probabi- 
lity density 

p(x,...,Xy, Ө) 
exists in the sample space, i.e., for any measurable subset W of 
the sample space 
р(и|ө) = Sy Р(ху›+..,хд, Ө) dx. 
In particular, the cumulative distribution function is 


given by 


Р(х\,...,х,, Ө) «f f. fr P( Vy, 005 Vg @)dv, EERI: AA 


тоо 200 
The maximum likelihood estimate ДӨР? 18 defined as 


that value of 0 for which р(ху,...,х ,Ө) becomes а maximum. 
Now assume that Xj,...,X, are n independently distributed ran- 
dom variables each having the same distribution. This can al- 
so be expressed by saying that 2),...,X, are n independent ob- 
servations on the same random variable X. The main result of 
Fisher's theory of estimation can be stated as follows: If 
X1,+++,%, are n independent observations (n = 1,..., ad inf.) 


on the same random variable X and if the distribution of X 
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satisfies certain conditions (which аге not too restrictive ва 
in practical application are frequently fulfilled), then ON 1s 
an efficient estimate. The definition of an efficient estimate 
is given as follows: 

A sequence [ul (n = 1,..., ad inf.) of statistics is 
called an efficient estimate of Ө (the subscript n indicates 
the number of observations of which t, 1s a function) if 

(1)the limit distribution ofYn (t, - 9) 18 8 
normal distribution with zero mean and finite 
variance, and 

(2)for any sequence {ti} of statistics which satis- 
fies (1) 

c*/c 1? £1 
where 02 = lim Ey [ук (ty - ө) ] 2 


end с'® = Иш Eg [vs (tr - 0) |? 
The ratio c /o' is called the efficiency of {tn} 
which is always 1. 

Vaguely speaking, in large samples the maximum likelihood 
estimate has the smallest variance compared with any other 
statistic which is in the limit normally distributed. The re- 
striction of the comparison to statistics which are in the 
limit normally distributed seems to be a serious one. However, 
a8 recent results show, the maximum likelihood estimate has а 
much stronger property than efficiency, and it can be con- 
sidered as a "best" large sample estimate of Ө compared even 


kí 
with statistics which are not normally distributed in the 1440) 


7) See reference 20 
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The question of consistency and limit distribution of the 
maximum likelihood estimate has been treated by Н. Hotelling,7. 
A complete proof has been given by J. L. Doob, 1. 

As an example, let ху, ...5х be n independent observations 
on & normally distributed variate X with unknown mean and unit 
variance. It can easily be verified that the maximum likeli- 
hood estimate of 6 is given by 

а РАБАР = morts 
Let tj(3x,,...,x4) be the median of the observations х1, eee, Xp, 
It can be shown that the limit distribution ofVün (t, - 9) 1s 


normal with zero mean and variance > . Hence, the efficiency 


of the median for estimating Ө is equal to 2. 0.6566... 
T 


IV THE THEORY OF CONFIDENCE INTERVALS 


The procedure of estimation, as I formulated it here, is 
also called estimation by & point. For practical applications 
the estimation by intervals seems to be much more important. 
That is to say, we have to construct two functions of the ob- 
servations Ө (E) and Ө (E), where E denotes a point of the sam- 
ple space, and we estimate the parameter to be within the in- 
terval S(E) = [e (E), $ (E)]. In connection with the theory 
of interval estimation,R. A. Fisher introduced the notion of 
fiducial probability and fiducial limits, while Neyman?) dev- 
eloped the theory of interval estimation based on the classical 
theory of probability. I shall give here a brief outline of 
Neyman's theory. 

Before the sample has been drawn the point E is a random 
variable and, therefore, the values of Ө (E) and Ө (E) are also 
random variables. Hence, before the sample has been drawn we 
can speak of the probability that 

(3) 9 (Е) = ө =6 (E) 
even if @ is considered merely as an unknown constant. After 
the sample has been drawn and we have obtained a particular 
sample point, say Eo, it does not make sense to speak of the 
probability that 

(4) Ө (Eg) = e«9(&5,), 
if @ is merely an unknown constant. Each term in the inequal- 


ity (4) is a fixed constant, and the inequality (4) is either 


8) See reference 15 
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right or wrong for those particular constants. It would be pro- 
per to talk about the probability of (4) if Ө itself could be 
considered as a random variable having a certain probability 
distribution, called an a priori probability distribution. In 
this case we understand by the probability that (4) holds the 
conditional probability, called also a posteriori probability, 
under the assumption that E = Eg occurred. If an a priori dis- 
tribution of Ө exists and if 1+ 1s known then, using Bayes'form 
ula, we can easily calculate the a posteriori probability dis- 
tribution of 0. However, in practical applications we seldom 
meet cases where the assumption of the existence of an a priori 
probability distribution seems to be justified; and even in 
those rare cases in which the latter assumption can be made, we 
usually do not know the shape of the a priori probability dis- 
tribution and this makes the application of Bayes' theorem im- 
possible. For these reasons the theory of interval estimation 
has to be developed in such a way that its validity should not 
depend on the existence of an a priori probability distribution. 
Hence, in thia theory we shall speak only of the probability of 
(3) but never of the probability of (4). 

For any relationship R we will denote by Р[н! ө] the proba- 
bility of R calculated under the assumption that O is the true 
value of the parameter. 

A pair of functions 9 (E) and Ө (E) 1s called-a confidence 
interval of Ө if 

1) e (E) = 8 (E) for all points of E 
2) [е (Е) = Ө = 9 (Е) | e] =a for all values of e, 


where a is a fixed constant called the confidence coefficient. 
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The practical meaning and importance of the notion of the 
confidence interval is this: If & large number of samples are 
drawn and if in each case we make the statement that Ө is in- 
cluded in the interval ө (Е), ө (к) |, then the relative fre- 
quency of correct statements will approximately be equal to a. 

In general, there exist infinitely many confidence inter- 
vals corresponding to a fixed confidence coefficient a, and we 
have to set up some principle for choosing from among them. It 
18 obvious that we want the confidence interval corresponding 
to a fixed confidence coefficient to be as "short" as possible. 
We have to give a precise definition of the notion "shortest" 
confidence interval. 

A confidence interval (E) =[ 2 (Е), Ө (Е)] 1s called & 
shortest confidence interval corresponding to the confidence 
coefficient a if 

(a) Р[е Œs es 9 (Е) | ө] =a and 
(b) for any confidence interval 4! (E) which satis- 
fies (a) 
P[ e (Е) = e'«$ (в) е"]| = е (Е) 4 ete $' ale] 
for all values Ө' and Ө" of e. 
If a shortest confidence interval exists, it seems to be the 
most advantageous. Unfortunately, shortest confidence inter- 
vals exist only in quite exceptional cases. Therefore, we have 
to introduce some further principles on which the choice should 
be based. Such a principle is the principle of unbiasedness. 

A confidence interval 4(E) is called an unbiased confidence 

interval corresponding to the confidence coefficient a if 
e (Е) = 0 =5 (Е) | еј = а 
and [ө (Е) = 9'+ 6 (E) | ө") * а for all values e'and 6. 
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A confidence interval J(E) is called a shortest unbiased 
confidence interval corresponding to the confidence coefficient 
a if d(E) is an unbiased confidence interval with the confiderce 
coefficient а and if for any unbiased confidence interval d'(E) 
with the same confidence coefficient, we have 

Р[ ө Œs er = © (Е) | ө"] = Р[ ә! (Е) = ө'= Bt (к) | e] 
for all values Ө! and ө". 

If we accept the principle of unbiasedness, the shortest 
unbiased confidence interval seems to be the most favorable one. 
Even shortest unbiased confidence intervals exist only in a 
restricted, but important, class of cases. If a shortest un- 
biased confidence interval does not exist, Neyman proposes the 
use of a third type of confidence interval, which he calls 
"short unbiased" confidence interval. An unbiased confidence 
interval 4(Е) with the confidence coefficient а is called a 


short unbiased confidence interval if 


ED Ң e(E)«e'«8(2)| e"] i ES Р[Ә' (Е)< e'«&' (к) | e") 

9"=9' ө"=9' 
for all Ө! and for all unbiased confidence intervals d'(E) with 
the confidence coefficient a. 

I have discussed only the case of a single unknown para- 
meter. In the case of several unknown parameters some new prob- 
lems arise, which do not occur in the case of a single рага- 
meter. However, I shall not discuss them, since the case of a 
single parameter already provides a good illustration of the 
basic ideas of the theories of Fisher, Neyman and Pearson. 


‚КЫШЫ oto: 

As we have seen, if a uniformly most powerful (unbiased) 
test and a shortest (unbiased) confidence interval exist, they 
provide a satisfactory solution of the problem of testing a 
hypothesis end the problem of interval estimation. Unfortuna- 
tely, they exist only in a restricted class of cases. Аз sub- 
stitutes for them the use of a critical region of type A and a 
short confidence interval, respectively, have been proposed. 
The appropriateness of the region of type A seems somewhat 
doubtful, since we are more interested in the behavior of the 
power function at values of Ө far from the value Ө to be tested 
than at ‘values of Ө near to Өс. Similar objections can be 
raised to the use of a short confidence interval. Recent in- 
vestigations show, however, that the situation is much more 
favorable than appears at first glance. It is shown that the 
difficulties arising because of the non-existence of uniformly 
most powerful unbiased tests and shortest unbiased confidence 
intervals gradually disappear with increasing size of the 
sample, since so-called asymptotically most powerful unbiased 
tests and asymptotically shortest unbiased confidence intervals 
“practically always exist. 

We shall assume that the observations Xj,...,Xy are n in- 
dependent observations on the same random variable X whose dis- 
tribution function involves a single unknown parameter Ө. We 
shall also assume that X has a probability density function, 


9) See references 17-20 29 
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say f(x,0). Since in our discussions the number of observa- 
tions n will not be kept constant, we shall indicate the dimen- 
sion of the sample space by proper subscripts. For instance, 
a critical region in the n-dimensional sample space will be 
denoted by a capital letter with the subscript n. x point of 
the n-dimensional sample space will be denoted by Ep’ and a 
confidence interval based on n observations by da (En)+ 
For any region Up denote by G(U,) the greatest lower 
bound of P(ujle). For any pair of regions Up and Tp denote by 
L(Un,TQ) the least upper bound of 
P [ X. (9) - P(T4| e) 
A sequence {ма \ (nz1,...,8d inf.) of regions is said to be 
вп asymptotically most powerful test of the hypothesis Ө = 0, 
on the level of significance a if P(W|0,) = a and if for any 
sequence {Zn} of regions for which P(2,|@,) = а, 
i sup L(Z,,W,) = 0 holds. 
A sequence {Wn (n=1,...,ad inf.) of regions is said to be 


an asymptotically most powerful unbiased test of the hypothesis 


and if for any sequence {Z,\of regions for which Р(2,] s = 
lim G(ZQ) = а the inequality lim sup L(Z,,W,) £0 holds. 

Let v a) be defined by 

(ses) = l.u.b. P(Z4l6) 

with respect to is regions 2, f for which Р(2219о ) = а. We will 
call Р„(Ө,а) the envelope function corresponding to the level 
of significance a. Similarly let P (0,0) be the least upper 
bound of P(Z4l0) with respect to all unbiased critical regions 
Zn Which have the size a. We will call P (0,0) the unbiased 


envelope function corresponding to the level of significance а. 
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The two previously given definitions are equivalent to the 
following two: 

A sequence {Wn} of regions is said to be an asymptotically 
most powerful test of the hypothesis Ө = Ө, on the level of 
significance а if P(W,|@,) = а and 
lim (5o? - 0) = 0 


п=со 
uniformly in Ө. 

A Sequence {Wn} of regions is said to be an asymptotically 
most powerful unbiased test of the hypothesis Ө = @, .on the 
level of significance а if P(W,| 90) = a and 


lim {РҖӨ,а) - IDE =0 
n-oo 


uniformly in Ө. 

Let Ф.х.) be the maximum likelihood estimate of Ө 
in the n-dimensional sample space. That is to say, $. denotes 
the value of 6 for which the product Th, f(xq,0) becomes a maxi- 
mum, Let МД be the region defined eaten inequality 
v¥ (8, = Ө) 20% , М" defined by the inequality үп (@,-0,) £c 
and let Wy be defined by the inequality [и (8,- ө) >d. The 
constants dp, с}, сй are chosen in such а way that 

P(WI10,) z ROS | ө.) 5 P(W,18,) =a. 
It has been shown that under certain restrictions on the proba- 
* bility density f(x,0) the sequence Qa] is an asymptotically 
most powerful test of the hypothesis Ө = Ө if Ө takes only 
values 2 Ө. Similarly {wit} із an asymptotically most powerful 
test if Ө takes only values Ө. Finally {Wn} is an asympto- 


tically most powerful unbiased test if @ can take any real vHue. 


32 


There are also other asymptotically most powerful tests. 
Let WA be the region defined by the inequality 


n ð 
b — Jom f(x., ө.) = » 


i 
Ип а=1 30 


Wh defined by the inequality 


3 
i 20 log f(x, ө.) = ea A 


and Wp defined by the inequality 
12 
F зо 2 log f(x,, 95)| 2 cn 
where the constants cy, C} and ср are chosen in such a way that 


P(WAIG,) = P(Wnle,) = P(W,|@,) = a. 


Then (uis an asymptotically most powerful test of the hypo- 
thesis Ө = 6, if Ө takes only values ĝo. Similerly, {w8 is an 
asymptotically most powerful test if Ө takes only values à ĝo. 
Finally (vn) is an asymptotically most powerful unbiased test 
1f 0 can take any real value. 

The sequence {An(®o)} is an asymptotically most powerful 
unbiased test of the hypothesis Ө = 05, where A,(@,) denotes 
the critical region of type A for testing the hypothesis Ө = 9, 

Since there are many asymptotically most powerful tests, 
the question arises whether they are all equally good or 
whether one can be preferred to another. It is clear that if 
{x} ana {wi} are two asymptotically most powerful unbiased tests, 
then for sufficiently large n they are equally good. In fact, 
for sufficiently large n both power functions P(Wp|e) and 
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Р(И]|Ө) are in a small neighborhood of P,(0,a) [56,27]. 
However, they may behave differently in the sense that with in- 
creasing n one power function, say P(W&l e) approaches the en- 
velope function faster than P(WA|80) does. In such a case it 
Seems preferable to use Wy, especially if the sample is only 
moderately large. If the sample is so large that both power 
functions are in a small neighborhood of the envelope function, 
then it 1s immaterial whether we use Wp or Wi. 

These considerations lead to the idea that it is preferstle 
to use that asymptotically most powerful (unbiased) test {Wn} 
for which the approach of P(Wnl@) to the envelope function is, 
in a certain sense, fastest. 

A region Wp is called a most stringent test of size a for 
testing the hypothesis Ө = Og if P(Wnl05) = а and 


Lebe Pn(@,a)-P(Wn!0)] Qa.u.v.[Fn(@,4)-P(2n!@)| 


for all Zp for which P(Z,/@,) = a. The abbreviation 1.0.0. 
means "least upper bound with respect to Өө." 

If W, is for each n a most stringent test, its power func- 
tion will approach the envelope function, in a certain sense, 
faster than. any other power function. It seems, therefore, 
desirable to use a most stringent test. A region of type A is 
not exactly а most stringent test, but probably 1% is quite 
near to it (this question has yet to be investigated), and 
this would provide a very good justification for the use of a 
type A region. The mathematical difficulties in finding ex- 


plicitly a most stringent test are considerable. 
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Let f4(E4) = [ t8); 8 (En) | be an interval function and 
denote by P(d (Ej) Ce'|e"] the probability thet dj (EQ) will 


cover Ө! under the assumption that Ө" is the true value of the 


parameter. 


A sequence of interval functions {dn(Bn)} (nz1,2,...,8d tf) 


I ААА А ББ Б 


1f the following two conditions аге fulfilled: 


(&) 
(b) 


»[ 4,(5,) cele | =a for all values of Ө 
For any sequence of interval functions 
TRN (nz1,2,..., ad inf.) which satisfies 
(a), the least upper bound of 

|4,08.) corio] - [Фм(к„) corlor] 
with respect to Ө! and Ө" converges to zero 


with п->»со. 


A sequence of interval functions {4.08.7} (nz1,2,..., ed inf) 


is called an asymptotically shortest unbiased confidence in- 


terval of @ if the following three conditions are fulfilled: 


(a) 
(b) 


(c) 


P| 4.5.) col e] = а for all values of Ө 
The least upper bound of {Sy (En) сә'|е"] with 
respect to Ө! and Ө" converges to a with n-ro 
For any sequence of interval functions {s} (Ep)} 
which satisfies the conditions (a) and (b), the 
least upper bound of 

PEs E) coio] - r[44( ce'te"] 
with respect to @'and Ө", converges to zero with 


п-ә со. 


Let C,(@) be a positive function of Ө such that the proba- 


3 
bility that he 2 3g 108 fax e| = cnl) 1s equal to & 
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constant a under the assumption that Ө is the true value of the 
parameter. Denote by @(E,) the root ir. Ө of the equation 


«9. 


s i log f(xg,9) = Cy(@) and by 6(E,) the root of 


виь 
Ф 


о 


5 Z log f(xg, 0) = -Cn(0). It has been shown that under 
В 


т 9e 

some restrictions on f(x,@) the interval d(E,) =[(9(2n),6(En)] 
1s an asymptotically shortest unbiased confidence interval of 
@ corresponding to the confidence coefficient a. This con- 
fidence interval is identical with that given by 0410810). 

The definitton of а shortest confidence interval underlying 
Wilks! investigations is somewhat different from that of Ney- 
man's, which has been used here. According to Wilks, a con- 
fidence interval 4(Е) is called shortest in the average if the 
expectation of the length of S(E) 1з a minimum. The main re- 
sult obtained by Wilks can be formulated as follows: The con- 
fidence interval in question is asymptotically shortest in the 
average compared with all confidence intervals the endpoints of 


which are roots of an equation of the following type: 


B 


In the present investigation such a restriction is not made. 
The confidence interval in consideration is shown to be asymp- 
totically shortest compared with any unbiased confidence in- 
terval. 

Now let Cpn(0) be a positive function of Ө such that the 
probability that | ® - e =C,(6) is equal to a constant a under 


10) See reference 22 


the assumption that @ is the true value of the parameter. De- 
note by (En) the root in Ө of the equation Sn - Ө = C,(@) апа 
by 9(E.) the root of $, - Ө = -C,(@). Consider the interval 
d (En) - [ә(8,), Э(Е) | . Under some restrictions on the den- 
sity f(x,0), it can be shown that S(E,) is an asymptotically 
shortest unbiased confidence interval. 

This 1з a much stronger property of the maximum likeli- 
hood estimate than its efficiency and gives a justification of 
the use of the maximum likelihood estimate also in the light of 


Neymants theory of estimation. 


ут OUTLINE OF A GENERAL THEORY OF STATISTICAL INFERENCE 


The theories of Fisher, Neyman and Pearson are restricted 
1n two respects. First, they consider only the problem of 
testing a hypothesis and that of estimation by point or in- 
terval. The second restriction is that only the case in which 
-Nis а k-parameter family of distribution functions is in- 
vestigated. Both restrictions are serious from the point of 
view of applications. 

There are many important statistical problems which are 
neither problems of testing a hypothesis, nor problems of es- 
timation. We have already given such an example in Section 1. 
As a further illustration, let us consider the following csse: 
Let Xi, 5s, Xp be p independently and normally distributed ran- 
dom variables with unit variances and unknown means Өү,...Өр. 
Furthermore, let х41],..5х:р be n independent observations on 
X (1 = 1,2,...,p). Suppose we test the hypothesis that 
91 =... = Өр = 0, and decide to reject this hypothesis on the 
basis of the pn observations xig(a 23.2 аа 5:212, «5 DJ. 
In such cases we are usually interested in knowing which mean 
values are not zero, 1.e.,we wish to subdivide the set of p 

. mean values 91, ** ep into two subsets, such that one of them 
contains the mean values which are zero and the other the mean 
values which are not zero. This subdivision has to be done, of 
course, on the basis of the pn observations Xia’ More pre- 
cisely, we have to deal with the following statistical problem: 
There exist 2P different subsets of the set (04, ..•,Өр). De- 


note these subsets by Q, seres Op, respectively. Let E 
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(k = Josep be the hypothesis that the mean values contsined 
in the set Фу are equal to zero and all other mean values are 
unequal to zero. On the basis of the pn observations we have 
to decide which hypothesis H, from the set of the 2P possible 
hypotheses should be accepted. This problem cannot be con- 
sidered as a problem of testing a hypothesis nor a problem of 
estimation. 

A similar problem arises if we wish to classify a set of 
regression coefficients into the class of non-zero and the 
class of zero regression coefficients. In problems of regres- 
sion we often take it for granted that the regression in ques- 
tion 18 a polynomial and we have to determine on the basis of 
the observations the degree of the polynomial to be fitted. 
That is to say, we have to decide on the basis of the observa- 
tions which hypothesis of the sequence of hypotheses 
Hy, Ho, Hgseres Hp, ••• should be accepted. The symbol Hg 
(п = 1,2,...) denotes the hypothesis that the regression is a 
polynomial of n-th degree. These examples illustrate suffici- 
ently the necessity of the extension of the theory of statis- 
tical inference to the general case as formulated in Section 1. 

The case in whichílcannot be represented as a k-parameter 
family of distribution functions is quite important. As an 
illustration, consider the following problem: Let (x3,33), *** 
(х, Уд) be n independent pairs of observations on а pair (X,Y) 
of random variables. Suppose we wish to test the hypothesis 
that X and Y are independently distributed and we do not have 
any & priori knowledge about the joint distribution of X and Y. 


In this case {L consists of all distribution functions 
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F(x1, 315 ** X3 Ty) which can be written in the form 
(дуу .)5= б Gin) 3 Gaya) 

where $ may be an arbitrary function. The subclass w consists 
of all distribution functions F(xj,yj,*..,Xy,Jn) which can be 
written in the form 

F(x3,335 ** Xy Yn) = P(X) (уу) @(хо){ (уд)... Pln Y (ул). 
Hence, (1 cannot be represented as a k-parameter family of 
functions. 

The problem given above as an illustration has been treat- 
ed by H. Hotelling and Margaret Pabst (see reference 8). An- 
other problem, where fX is the class of all continuous distri- 
butions, has been considered in paper (see reference 21). We 
shall give here an outline of a theory of statistical inference 
dealing with the following general problem?) 

Let Xi,...,X, be а set of n randam variables. It is knon 
that the joint probability distribution function F(xj,...,Xq) 
of Xj,...,X, 1з an element of a certain class fL of distribu- 
tion functions. Let S be a system of subclasses of N. For 
each element w of S denote by Hy the hypothesis that the true 
distribution Е(хү,...,х„) of Xj,...,Xy is an element of w. 
Denote by Hg the system of all hypotheses corresponding to all 
elements of S. Let xi be the observed value of X, (4m15 .5:5 n) 
We have to decide by means of the observed sample point 
En = (xi,...,X4) which hypothesis of the system Hg of hypo- 
theses should be accepted. That is to say, for each hypothesis 
Hy we have to determine a region of acceptance My in the n- 


dimenSional sample space. The hypothesis H, will be accepted 


11) This theory has been developed in reference 16 
for the case thatis a k-parameter family 
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if and only if the sample point falls in the region Mg. The 
regions M, and Mg: are, of course, disjoint for o É o'. Fur- 
thermore, 2 M, is equal to the whole sample space. The ststis- 
tical оный is that of the proper choice of the system Mg of 
the regions of acceptance. 

The choice of the system Mg of regions of acceptance is 
equivalent to the choice of a function w(E,) defined over all 
points E, of the sample space. The value of the function 
o(E,) is an element of S determined as follows: Since the ele- 
ments of Mg are disjoint and since z M, 1з equal to the whole 
sample space, for each point Ер there exists exactly one ele- 
ment w of S such that Ey is contained in Mg. The value of the 
function w(En) is that element o of S for which En is an ele- 
ment of Mg. Hence, we cen replace Mg by the function (Ey) 
and for each sample point En we decide to accept the hypothesis 
Hy(En). We will call o(Eg) the statistical decision function. 
Hence, the statistical problem is that of choosing the statis- 
tical decision function w(En). 

The choice of (Еһ) will essentially be affected by the 
relative importance of the different possible errors we may 
commit. We commit an error whenever we accept a hypothesis Ho 
and the true distribution is not an element of w. We introduce 
a weight function for the possible errors. The weight function 
wLF,w] 1з a real valued non-negative function defined for all 
elements F of fi and all elements w of S, expressing the re- 
lative importance of the error committed by accepting Hy when 
F is true. If Е is an element of w then w[F,w] = 0, otherwise 
wLF,G]? 0. The question as to how the fom of the weight func- 


tion wlF,o | should be chosen is notamathematical nor statistical 
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one. The statistician who wants to test certain hypotheses 
must first determine the relative importance of all possible 
errors md this will depend on the special purposes of his in- 
vestigation. If this is done, we shall in general be able to 
give a more satisfactory answer to the question as to how the 
statistical decision function should be chosen. In many cases, 
especially in statistical questions concerning industrial pro- 
duction, we are able to express the importance of an error in 
monetary terms, that is, we can express the loss caused by the 
error considered in terms of money. We shall also say that 
w [5,«] is the loss caused by accepting H when F is true. 
Suppose that we make our decisions according to a statis- 
tical decision function w(E,,), and that the true distribution 
is the element Е(х],...,хһ) of.(2. Then the expected value of 
the loss is obviously given by the Stieltjes integral 


(5) dro Saran ong »r[F], 


n 
where the integration is to be taken over the whole sample space 


Mn. We shall call the expression (5) the risk of accepting a 
false hypothesis when F is the true distribution function. 
Since we do not know the true distribution F we shall have to 
study the risk r[ Е] аз a function of F. We shell call this 
function the risk function. Hence, the risk function is defined 
over all elements F of f1. Тһе form of the risk function de- 
pends on the statistical decision function o(E,) and on the 
weight function w[F,w |. In order to express this fact, we 
shall denote the risk function associated with the statistical 
decision function w(E,) and the weight function w[F,W]also by 
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г (eio (En); x.) 


Ме introduce the following definitions: 

Definition 1. Denote by w(E,) and w'(E,) two statistical 
decision functions for the same system Нс of hypotheses. We 
shall say that (E,) and w'(E,) are equivalent relative to the 
weight w[F o] if the risk function r [rot w[F, 0} 
4s identically equal to the risk function r (їое) [e] 
1.e.,for any element F of © we have 

r (r9), "Ea s r (rio, 1) 5 

Definition 2. Denote by w(E,,) апа w'(E,) two statistical 
decision functions for the seme system Hg of hypotheses. We 
shall say that w(E,) is uniformly better than w'(E,) relative 
to the weight function w[F, o] if Ф(Еһ) and w'(E,) are not equiva- 
lent and for each element F of Q we have 

r (rea), [a] zr ILE JEZIL, $ 

Definition 3. A statistical decision function w(E,) 18 
said to be admissible relative to the weight function w[F, o] 
1f no uniformly better statistical decision function exists re- 
lative to the weight function considered. 

First principle for the choice of the ststistical decision 
function. We choose a statistical decision function which is 
admissible relative to the weight function considered. 

There can scarcely be given any argument against the ac- 
ceptance of the above principle for the selection of (Ey). 
However, this principle does not lead in general to a unique 
solution. There exist in general many admissible statistical 
decision functions. We need a second principle for the choice 


of a best admissible decision function. 


43 

The choice between two admissible decision functions (Ey ) 
and €! (En) may be affected by the degree of our a priori con- 
fidence 1n the truth of the different elements of (X. Suppose, 
for instance, that for a certain element Еу ofA we have 

r (Fi tot), [E o]) <r {F2lo'(2,),[F,0]} 
for another element Fo of С} we have 

r {Fo 1u(z,),[F, 9]) »r САСЫ 
and for any other element Р f Fi, f Fp we have 

r {F 1асв,),н ET = r {Plor Er а} . 
If ме have much greater a priori confidence in the truth of F] 
than in that of Fo, we will probably prefer o(En) to o'(Ep). 
On the other hand, if we think a priori that Fo is more likely 
to be true than Fi, we may prefer w'(E,) to ш(Еү,). 

Suppose we сап express our а priori degree of confidence 
by & non-negative additive set function p(n) defined over a cer 
tain system of subsets т of л , where р(С\) = 1. That is to say 
the value of p(n) expresses the degree of our a priori belief 
that the true distribution is an element of the subset т. Іп 
such a case it seems very reasonable to consider a decision 
function ož Ej ) as "best" if the value of the integral 

pi r { F| (En), м Cr, 1) dp 
becomes a minimum for Е.) = Чел). That 18, we consider а 
decision function ҖЕ) as "best" if it minimizes a certain 
‘weighted average of the risk function. 

However, it is doubtful that a set function expressing our 
a priori degree of belief can meaningfully be constructed. 
Therefore, we prefer to formulate the notion of a "best" dec- 


ision function indepehdently of such considerations. 
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Denote by r (so. «e o] the least upper bound of 
r (risen. s[e} with respect to F, where F may be any ele- 
ment of £1, 

Definition 4. A decision function w(E,) is said to be a 
"best" decision function if r (968), „Er, o] } becomes a mini- 
mum for wW(E,) = w*(E,). (The weight function “ғ, о] 1з соп- 
sidered fixed.) 

This definition of a "best" decision function seems to be 
& very reasonable one, although it is not the only possible one, 
One could reasonably define a decision function as "best" if it 
minimizes a certain weighted average of the risk function. 
However, there are certain properties of the "best" decision 
function according to definition 4, which seem to justify the 
use of that definition. One of the most important properties 
of a "best" decision function in the sense of definition 4 1s 
that the risk function is a constant, 1.e.,1t has the same 
value for all elements F of. This has been shown in the 
case Па С\_1з a k-parameter family of distributions, and the 
weight function wLF, a] and the distribution functions F satisfy 
eertain restrictive conditions. The constancy of the risk func- 
tion seems to be very desirable from the point of view of appli- 
cations since this property makes it possible to evaluate the 
exact magnitude of the risk associated with the statistical de- 
cision. In the theory of confidence intervals the confidence 
coefficient, a, 1.e.,the probability that the confidence in- 
terval will cover the unknown parameter, is independent of the 
value of the unknown parameter. This fact, which is considered 


to be of basic importance in the theory of interval-estimation, 


4b 


1з anelogous to the constancy of the risk function in our gen- 
eral theory since 1-0 can be considered in a certain sense as 
the risk associated with the interval estimation. (The quantity 
l-a 18 exactly equal to the risk in the sense of our definitio, 
1f the weight function takes only the values O and 1.) 

Finally, I should like to make some remarks about the re- 
lationship of the general theory as outlined here, to the parti- 
cular theory of uniformly most powerful and asymptotically most 
powerful tests which were discussed before. In the case of 
testing the simple hypothesis that the unknown distribution 
F(X,,+++,X,) is equal to a particular distribution Р(х], °% ° Xn)» 
the system S of subsets of.fi consists only of two elements ву 
and wg where wj contains the single element F, апа w is the 
complement of «ej inf. Hence, the decision function w(E,) can 
assume merely the values шу and о. Let Mo be the subset of 
the sample space consisting of the points E, for which o(E,, )=0) 
and let Mog be the set of points Ep for which o(Eņp)=wg. The 
set Mor, is the complement of Mug in the sample space. Obviously 
the set Mog 1s the critical region, in the sense of the Neyman- 
Pearson theory. It is easy to see that if for any a(O«a«l) а 
uniformly best critical region of size a for testing F = Fo 
exists, then for any arbitrary weight function and for any 
admissible (see definition 3) decision function W(E,), the set 
Mw will be auniformly best critical region. In particular, the 
set Мо, corresponding to the "best" decision function (see def- 
inition 4) will be a uniformly best critical region. Hence, the 
form of the weight function affects merely the size of the re- 
gion Ma, associated with the "best" decision function o(Eg), 
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but it will always be a uniformly best critical region in the 
sense of the Neyman-Pearson theory. Similar considerations 
hold concerning asymptotically most powerful tests. Let the 
sequence {wnt (n=1,2,...,ad inf.) of critical regions be an ав» 
ymptotically most powerful test for testing the simple hypothe- 
sis F = Fg. Then for sufficiently large n the region Wy is 
practically a uniformly best critical region and, therefore, it 
will be an excellent approximation to the region which is "best" 
1n the sense of definition 4 irrespective of the shape of the 
weight function of errors. 

As we have seen, for building up a general theory of 
statistical inference, the following three steps have to be 
made: 

1. Formulation of the general problem of statistical 
inference. 

2. Definition of the "best" procedure for making sta- 
tistical decisions, 1.e., definition of the "best" 
statistical decision function. 

3. Solution of the mathematical problem of calculating 
the "best" statistical decision function. 

The problem of statistical inference, as we have formulated 
it here,seems to be sufficiently broad to cover the problems in 
practical applications. The second step will always be, to & 
certain extent, arbitrary. The definition of "best" decision 
function given here seems to be a satisfactory one. Moreover, 
under certain restrictive conditions it has the important prop- 
erty that the risk function associated with the "best" decision 
function is constant, 1.e., it has the same value for all ele- 


ments оѓ Съ. However, there may be other definitions of a 
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"best" decision function worth investigating. Decision func- 
tions which minimize a certain average of the risk function may 


be of special interest. Concerning step 3, there are many 


mathematical problems as yet unsolved. 
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